OpenMP Processor Parallel Computing Benchmarks - Roy Longbottom's PC benchmark Collection

Title

Roy Longbottom at Linkedin OpenMP Parallel Computing Benchmarks

General Example Log Files Different Version Results

64 Bit OpenMP Comparisons Assembly Code Numeric Accuracy

Summary

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. The benchmark executes the same functions, using the same data sizes, as the CUDA Graphics GPU Parallel Computing Benchmark, with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code. A run time Affinity option is available to execute the benchmark on a selected single processor.

The benchmarks demonstrate a near doubling of performance, using dual core processors, when not limited by memory speed and when the source code is compatible with Single Instruction Multiple Data (SIMD) operation. All that is needed for the speed increase is an extra directive in the source code (implying parallelise this) and a compilation parameter. Later tests show up to four times faster speeds using a quad core processor.

Potential performance gains due to hardware SIMD with SSE instructions are not realised due to compiler limitations and this enhances the comparative benefit of CUDA GPU parallel processing. On the other hand, the benchmark, compiled for 64 bit working, demonstrates significant speed improvement using the eight additional SSE registers that are available. Then, it also appears that certain compiler optimisation options (like loop unrolling) cannot be implemented on using OpenMP.

The benchmarks identify three slightly different numeric results on tests using SSE, old i387 and CUDA floating point instructions. Results output has been revised to provide more detail.

Other benchmarks have been converted to run using OpenMP and are described in OpenMP Speeds.htm. Observations are that performance with smaller data arrays can be extremely slow, due to high startup overheads, and wrong numeric results can be produced with careless use of OpenMP directives.

The benchmarks can be downloaded via OpenMPMflops.zip. No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.

The OpenMP tests have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see linux benchmarks.htm, linux openmp benchmarks.htm and download benchmark execution files, source code, compile and run instructions in linux_openmp.tar.gz. Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.

See GigaFLOPS Benchmarks.htm for further details and results, including comparisons with MP MFLOPS, a threaded C version, CUDA MFLOPS, for GeForce graphics processors, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via gigaflops-benchmarks.zip.

To Start

General

OpenMP is a system independent set of procedures and software that arranges parallel processing of shared memory data. This option is available in the latest C++ Microsoft compilers. In this case, 32 bit and 64 bit versions used were from the free Windows Driver Kit Version 7.0.0. For OpenMP, Microsoft Visual C++ 2008 Redistributable Packages for x86 and x64 were also downloaded. For comparison purposes, the OpenMP benchmarks execute the same functions as the CUDA tests - see Benchmark Details. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. All that is required to arrange for the code to be run on more than one CPU is a simple directive:

 #pragma omp parallel for
 for(i=0; i < n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;

There are some issues with the Microsoft compilers that limit performance. Using SSE instructions, the hardware registers can contain four data words to permit such as four simultaneous adds - Single Instruction Multiple Data (SIMD) operation - as OpenMP directives. The compilers appear to only generate the single data type instructions (SISD), operating on 32 bits out of the 128 bits provided.

Using processors at 64 bits, the old i387 instructions are not available and SSE types have to be used but more registers are available for optimisation. The 64 bit version of the benchmark at least demonstrates more than one floating point result per CPU clock cycle (linked add and multiply?).

Results are provided for an Athlon 64 X2 with XP x64, Core 2 Duo processors using 32-Bit and 64-Bit Vista, a Phenom II X4 via 64-Bit Windows 7 and a Core i7 again using Windows 7. On one CPU of a 2.4 GHz Core 2 Duo, up to 3.5 GFLOPS is produced or 6.8 GFLOPS using both processors. Corresponding results for a four processor 3 GHz Phenom II are 3.7 and 14.5 GFLOPS.

The quad CPU Core i7 results are difficult to interpret, the first issue being that Hyperthreading is available where 8 threads can be run at the same time and this could have some impact even with purely floating point calculations. The main problem is Turbo Boost where, using a single CPU, it can run much faster than its rated MHz. Even four processors can run faster than the rating if not too hot. Results provided are for two 2.8 GHz i7 processors with different Turbo Boost speeds of up to 3.066 GHz and 3.466 GHz.

At 32 bits, the latest compilers refuse to obey the /arch:SSE parameter and produce only i387 floating point instructions. The ZIP file contains SSE32MFLOPS.exe, a single processor version, produced for SSE operation via an earlier compiler. Some results are given below.

The benchmark can be downloaded via OpenMPMflops.zip. No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.

The benchmarks have run time parameters to change the number of words used and repeat passes that might need adjusting for timing purposes. There is also an option to select a single processor via an Affinity setting. BAT files containing examples of run time parameters are in the ZIP file.

To Start

Example Log Files

The CUDA graphics parallel computing benchmark has three lots of tests where two do not involve transferring data to and/or from the host cpu’s memory. The tests here can be compared with the CUDA "Data in & out" test. Below is a sample log file for the 64 Bit version on a 2.4 GHz Core 2 Duo via Vista. The second results are for a single selected CPU.


 64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:19 2009
 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
 Words Word Passes Results Same

 Data in & out 100000 2 2500 0.194304 2573 0.929538 Yes
 Data in & out 1000000 2 250 0.193139 2589 0.992550 Yes
 Data in & out 10000000 2 25 0.415691 1203 0.999250 Yes
 Data in & out 100000 8 2500 0.312285 6404 0.957117 Yes
 Data in & out 1000000 8 250 0.335818 5956 0.995517 Yes
 Data in & out 10000000 8 25 0.473814 4221 0.999549 Yes
 Data in & out 100000 32 2500 1.488048 5376 0.890211 Yes
 Data in & out 1000000 32 250 1.891056 4230 0.988082 Yes
 Data in & out 10000000 32 25 1.185456 6748 0.998796 Yes

 64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:31 2009
 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
 Single CPU Affinity 1
 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
 Words Word Passes Results Same

 Data in & out 100000 2 2500 0.313641 1594 0.929538 Yes
 Data in & out 1000000 2 250 0.317088 1577 0.992550 Yes
 Data in & out 10000000 2 25 0.431107 1160 0.999250 Yes
 Data in & out 100000 8 2500 0.584243 3423 0.957117 Yes
 Data in & out 1000000 8 250 0.594728 3363 0.995517 Yes
 Data in & out 10000000 8 25 0.605958 3301 0.999549 Yes
 Data in & out 100000 32 2500 2.268676 3526 0.890211 Yes
 Data in & out 1000000 32 250 2.261049 3538 0.988082 Yes
 Data in & out 10000000 32 25 2.270906 3523 0.998796 Yes
 Hardware Information
 CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
 Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
 Windows Information
 AMD64 processor architecture, 2 CPUs 
 Windows NT Version 6.0, build 6002, Service Pack 2
 Memory 4095 MB, Free 2854 MB
 User Virtual Space 8388608 MB, Free 8388560 MB

To Start

Results From Different Versions

Following are results of the single processor SSE test, the 32 bit i387 OpenMP benchmark, the 64 bit SSE OpenMP version and MFLOPS obtained using CUDA. The latter are given for tests with copying data from/to host RAM or to/from graphics RAM and those for graphics processor calculations without external data transfers. Systems used are a Core 2 Duo with 64-Bit Vista and an AMD Athlon 64 X2 using XP x64, followed by a Pentium 4 with 32-Bit XP and a Core 2 Duo laptop with 32-Bit Vista. Later results are for a quad core Phenom II CPU using 64-Bit Windows 7 and a much faster graphics card. Even later are for a quad core Intel i7 processor, with a top end graphics card and, again, using 64-Bit Windows 7. This processor can use Hyperthreading and appears to Windows as having eight CPUs. Latest results are for a Core i5 dual that also has Hyperthreading.

Single, Dual and Quad CPUs - Appropriate performance gains are obvious on increasing the number of calculations per memory access. With two calculations per word there can be little gain using more than one CPU, as performance is limited by main memory speed. Some results on the AMD Athlon CPU reflect the smaller, slower L2 cache.

CPU and GPU - Particularly as the compiler used does not fully implement SSE SIMD instructions, the GPU CUDA operations can be attractively fast, the latest results showing up to 24 times faster using a GTX 480.

SSE and i387 - Again because of compiler limitations, the old i387 floating point instructions can produce comparable performance, in some cases.

32 and 64 Bit SSE - Faster performance using a 64-Bit compilation could be expected, due to the availability of more registers for optimisation, but this is not always the case. Examination of actual intermediate machine code instructions can provide an explanation (see below).

Hyperthreading - This does not appear to improve maximum throughput of the four core i7 by much more than four times. Using one or two threads, the processors are likely to be running at the Turbo Boost speed of 3066 MHz but falling back to 2933 MHz with four threads (or 2800 MHz if hot), reducing relative performance. For more details and Hyperthreading results with other benchmarks see Quad Core 8 Thread.htm.


 Core i7 930 2.8 GHz increased by Turbo Boost up to 3.066 GHz using 
 1 CPU and up to 2.933 GHz using 4 CPUs - Windows 7 64 - MFLOPS
 CUDA CUDA
 Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
 Words Word 1 CPU 1 CPU 4/8 CPU 1 CPU 4/8 CPU GTX480 GTX480

 100000 2 3567 1248 4455 1574 4001 521 5554
 1000000 2 3529 1420 5433 1861 4919 819 21493
 10000000 2 2388 1364 3038 1735 3076 1014 31991
 100000 8 4655 2337 8798 3794 14581 2058 20129
 1000000 8 4642 2413 9813 4149 17080 3306 82132
 10000000 8 4453 2436 9581 4011 12457 4057 125413
 100000 32 3328 2957 12020 4324 16786 7768 52230
 1000000 32 3329 3011 12339 4436 17599 13190 254306
 10000000 32 3307 3003 12432 4418 17576 16077 425237

 Phenom II X4 3.0 GHz, Windows 7 64 - MFLOPS
 CUDA CUDA
 Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
 Words Word 1 CPU 1 CPU 4 CPU 1 CPU 4 CPU GTS250 GTS250

 100000 2 3552 1920 5587 1822 5613 328 3054
 1000000 2 3268 1919 5585 1870 7056 625 9672
 10000000 2 1861 1625 2993 1563 2972 714 13038
 100000 8 4535 2115 7763 3637 12653 1336 12233
 1000000 8 4341 2108 7975 3709 14518 2382 39481
 10000000 8 4141 2100 8062 3543 11273 2949 51199
 100000 32 4012 2566 9675 3652 14092 5142 36080
 1000000 32 3981 2552 10091 3663 14510 9427 108170
 10000000 32 3941 2510 9902 3633 14034 11182 135041

 Core 2 Duo 2.4 GHz, Vista 64 - MFLOPS
 CUDA CUDA
 Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
 Words Word 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 8600GT 8600GT

 100000 2 2524 1599 2660 1594 2573 215 1770
 1000000 2 2353 1617 2957 1577 2589 342 3479
 10000000 2 1158 1180 1136 1160 1203 417 3874
 100000 8 3647 2063 3948 3423 6404 886 6931
 1000000 8 3445 2070 3624 3363 5956 1371 13250
 10000000 8 3231 2058 3962 3301 4221 1661 14281
 100000 32 2590 2653 4909 3526 5376 3329 16583
 1000000 32 2659 2658 4580 3538 4230 5019 27027
 10000000 32 2663 2649 5183 3523 6748 5975 28923

 Core i5-2467M 1.6 GHz to 2.3 GHz Turbo Boost
 Dual Core + Hyperthreading, Windows 7 - MFLOPS 
 Data Ops/ SSE i387 SSE 64b
 Words Word 1 CPU 2 CPU 2 CPU

 100000 2 1611 975 1613
 1000000 2 2247 2100 1917
10000000 2 1625 1603 1681
 100000 8 2829 2621 3524
 1000000 8 3248 2756 3604
10000000 8 3458 2844 5377
 100000 32 3308 3691 4032
 1000000 32 3330 3994 4178
10000000 32 3322 4898 5041

 AMD Athlon 64 X2 2.2 GHz, XP x64 - MFLOPS
 A64 A64 A64 A64 A64
 Data Ops/ SSE i387 i387 SSE 64b SSE 64b
 Words Word 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU

 100000 2 1304 1060 1961 1114 2015
 1000000 2 659 639 812 638 817
 10000000 2 665 640 837 636 831
 100000 8 2084 1495 2922 1942 3783
 1000000 8 1853 1369 2629 1692 3058
 10000000 8 1861 1376 2701 1706 3110
 100000 32 2488 1852 3428 1731 3254
 1000000 32 2439 1813 3614 1793 3369
 10000000 32 2443 1818 3629 1774 3443

 32 Bit Windows Pentium 4 Core 2 Duo Laptop Atom Netbook
 MFLOPS MFLOPS MFLOPS
 
 CPU P4 P4 C2D C2D C2D Atom Atom Atom
 MHz 1900 1900 1829 1829 1829 1600 1600 1600
 XP32 XP32 V32 V32 V32 XP32 XP32 XP32
 Data Ops/ SSE i387 SSE i387 i387 SSE i387 i387
 Words Word 1 CPU 1 CPU 1 CPU 1 CPU 2 CPU 1 CPU No HT HT

 100000 2 221 223 1811 1201 2063 264 175 323 
 1000000 2 224 224 673 650 630 259 185 311
 10000000 2 204 206 651 668 650 258 189 331
 100000 8 835 742 2648 1558 2773 409 257 460 
 1000000 8 817 699 2326 1529 2568 406 263 443
 10000000 8 764 771 2331 1508 2645 406 265 475
 100000 32 1160 1017 1935 1978 3627 457 369 679
 1000000 32 1163 1025 1970 1977 3719 456 371 679
 10000000 32 1165 1029 2015 1921 3727 456 372 677
 Single processor Atom i387 results Hyperthreading off and on

To Start

64-Bit Comparisons

Following are OpenMP benchmark result for the version compiled for 64 bit working, with performance gains shown when using multiple processors. These gains are lowest using 10M words (40 MB) with an add and a multiply for each word read, limited by RAM speed. There is generally no such limitation with 32 operations per word at all data sizes.

These results include those for two 2.8 GHz Core i7 CPUs that have different Turbo Boost characteristics. In this case, the i7 860 had been detuned and, based on results with 32 operations per word, single CPU tests suggest that both were running at around 3 GHz, with Core i7/Core 2 measured speed ratios similar to MHz ratios (3066/2400 = 4510/3530). The i7 860 has faster RAM, affecting tests with fewer operations per word.


 64 Bit OpenMP Benchmark MFLOPS
 Athlon 64 x2 Core 2 Duo
 
 Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain
 Words Word 1 CPU 2 CPU 1 CPU 2 CPU

 100000 2 1114 2015 1.8 1594 2573 1.6
 1000000 2 638 817 1.3 1577 2589 1.6
 10000000 2 636 831 1.3 1160 1203 1.0
 100000 8 1942 3783 1.9 3423 6404 1.9
 1000000 8 1692 3058 1.8 3363 5956 1.8
 10000000 8 1706 3110 1.8 3301 4221 1.3
 100000 32 1731 3254 1.9 3526 5376 1.5
 1000000 32 1793 3369 1.9 3538 4230 1.2
 10000000 32 1774 3443 1.9 3523 6748 1.9

 Phenom II Core i7 860 Core i7 930
 
 Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain
 Words Word 1 CPU 4 CPU 1 CPU 4 CPU 1 CPU 4 CPU

 100000 2 1822 5613 3.1 1661 4263 2.6 1574 4001 2.5
 1000000 2 1870 7056 3.8 1922 5142 2.7 1861 4919 2.6
 10000000 2 1563 2972 1.9 1824 3838 2.1 1735 3076 1.8
 100000 8 3637 12653 3.5 3939 13804 3.5 3794 14581 3.8
 1000000 8 3709 14518 3.9 4251 18082 4.3 4149 17080 4.1
 10000000 8 3543 11273 3.2 4133 15079 3.6 4011 12457 3.1
 100000 32 3652 14092 3.9 4438 16299 3.7 4324 16786 3.9
 1000000 32 3663 14510 4.0 4512 18081 4.0 4436 17599 4.0
 10000000 32 3633 14034 3.9 4493 17752 4.0 4418 17576 4.0
 i7 860 2.8 GHz, Turbo Boost possible to 3.47 GHz using 1 CPU to 2.93 GHz using 4 
 i7 930 2.8 GHz, Turbo Boost possible to 3.07 GHz using 1 CPU to 2.93 GHz using 4

To Start

Assembly Code

The benchmarks were compiled using the /Fa option which produces a file with assembly code listing. These show significant differences between 64 bit and 32 bit compilations, also if the /openmp directive is or is not included.

The most obvious difference is when using two operations per word, where 32 bit compilation unrolls the loop (using x[i], x[i+1], x[i+2] and x[i+3] with four times as many calculations). This results in some much faster speeds for the 32 bit version. A further 64 bit compilation, without /openmp, included unrolling.

For the other extreme, where 64 bit compilation is much faster, memory accesses are reduced by using the additional registers. These accesses are indicated by such as addss xmm6, DWORD PTR _g$[esp] and the extra registers by xmm8 to xmm15 (really needs 24 registers - CUDA has more).


 64 Bit SSE Instructions 32 Bit SSE Instructions
 2 Operations Per Word
 for(i=0; i< n; i++) x[i]=(x[i]+a)*b; 
 $LL6@triad$omp$: $L56949:
 ; Line 77 ; Line 77
 movaps xmm0, xmm1 movss xmm2, DWORD PTR [eax-8]
 add rax, 4 addss xmm2, xmm1
 sub rcx, 1 mulss xmm2, xmm0
 addss xmm0, DWORD PTR [rax-4] movss DWORD PTR [eax-8], xmm2
 mulss xmm0, xmm2 movaps xmm2, xmm1
 movss DWORD PTR [rax-4], xmm0 addss xmm2, DWORD PTR [eax-4]
 jne SHORT $LL6@triad$omp$ mulss xmm2, xmm0
 movss DWORD PTR [eax-4], xmm2
 movss xmm2, DWORD PTR [eax]
 addss xmm2, xmm1
 mulss xmm2, xmm0
 movss DWORD PTR [eax], xmm2
 movss xmm2, DWORD PTR [eax+4]
 addss xmm2, xmm1
 mulss xmm2, xmm0
 movss DWORD PTR [eax+4], xmm2
 add eax, 16
 dec edx
 jne SHORT $L56949

 8 Operations Per Word
 for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f; 
 $LL6@triadplus$: $L56942:
 ; Line 69 ; Line 69
 movss xmm1, DWORD PTR [rcx] movss xmm6, DWORD PTR [eax-8]
 add rcx, 4 movss xmm7, DWORD PTR [eax-8]
 sub rax, 1 addss xmm7, xmm3
 movaps xmm2, xmm1 mulss xmm7, xmm2
 movaps xmm0, xmm1 addss xmm6, xmm5
 addss xmm1, xmm7 mulss xmm6, xmm4
 addss xmm2, xmm3 subss xmm6, xmm7
 addss xmm0, xmm5 movss xmm7, DWORD PTR [eax-8]
 mulss xmm1, xmm8 addss xmm7, xmm1
 mulss xmm2, xmm4 mulss xmm7, xmm0
 mulss xmm0, xmm6 addss xmm6, xmm7
 subss xmm2, xmm0 movss DWORD PTR [eax-8], xmm6
 addss xmm2, xmm1 movaps xmm6, xmm5
 movss DWORD PTR [rcx-4], xmm2 addss xmm6, DWORD PTR [eax-4]
 jne SHORT $LL6@triadplus$ mulss xmm6, xmm4
 movaps xmm7, xmm3
 addss xmm7, DWORD PTR [eax-4]
 mulss xmm7, xmm2
 subss xmm6, xmm7
 movaps xmm7, xmm1
 addss xmm7, DWORD PTR [eax-4]
 mulss xmm7, xmm0
 addss xmm6, xmm7
 movss xmm7, DWORD PTR [eax]
 movss DWORD PTR [eax-4], xmm6
 movss xmm6, DWORD PTR [eax]
 addss xmm7, xmm3
 mulss xmm7, xmm2
 addss xmm6, xmm5
 mulss xmm6, xmm4
 subss xmm6, xmm7
 movss xmm7, DWORD PTR [eax]
 addss xmm7, xmm1
 mulss xmm7, xmm0
 addss xmm6, xmm7
 movss xmm7, DWORD PTR [eax+4]
 movss DWORD PTR [eax], xmm6
 movss xmm6, DWORD PTR [eax+4]
 addss xmm7, xmm3
 addss xmm6, xmm5
 mulss xmm7, xmm2
 mulss xmm6, xmm4
 subss xmm6, xmm7
 movss xmm7, DWORD PTR [eax+4]
 addss xmm7, xmm1
 mulss xmm7, xmm0
 addss xmm6, xmm7
 movss DWORD PTR [eax+4], xmm6
 add eax, 16
 dec edx
 jne $L56942

 32 Operations Per Word
 for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f-(x[i]+g)*h+(x[i]+j)*k -(x[i]+l)*m+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t-(x[i]+u)*v+(x[i]+w)*y; 
 $LL6@triadplus2: $L56934:
 ; Line 61 ; Line 61
 movss xmm2, DWORD PTR [rbp] movss xmm5, DWORD PTR [edx+ecx*4]
 add rbp, 4 addss xmm5, DWORD PTR _a$[esp]
 sub r12, 1 mulss xmm5, DWORD PTR _b$[esp]
 movaps xmm0, xmm2 movss xmm6, DWORD PTR [edx+ecx*4]
 movaps xmm1, xmm2 addss xmm6, DWORD PTR _c$[esp]
 movaps xmm3, xmm2 mulss xmm6, DWORD PTR _d$[esp]
 addss xmm0, xmm6 subss xmm5, xmm6
 addss xmm3, xmm4 movss xmm6, DWORD PTR [edx+ecx*4]
 addss xmm1, xmm8 addss xmm6, DWORD PTR _e$[esp]
 mulss xmm0, xmm7 mulss xmm6, DWORD PTR _f$[esp]
 mulss xmm3, xmm5 addss xmm5, xmm6
 mulss xmm1, xmm9 movss xmm6, DWORD PTR [edx+ecx*4]
 subss xmm3, xmm0 addss xmm6, DWORD PTR _g$[esp]
 movaps xmm0, xmm2 mulss xmm6, DWORD PTR _h$[esp]
 addss xmm3, xmm1 subss xmm5, xmm6
 addss xmm0, xmm10 movss xmm6, DWORD PTR [edx+ecx*4]
 movaps xmm1, xmm2 addss xmm6, DWORD PTR _j$[esp]
 mulss xmm0, xmm11 mulss xmm6, DWORD PTR _k$[esp]
 subss xmm3, xmm0 addss xmm5, xmm6
 addss xmm1, xmm12 movss xmm6, DWORD PTR [edx+ecx*4]
 movaps xmm0, xmm2 addss xmm6, DWORD PTR _l$[esp]
 mulss xmm1, xmm13 mulss xmm6, DWORD PTR _m$[esp]
 addss xmm3, xmm1 subss xmm5, xmm6
 addss xmm0, xmm14 movss xmm6, DWORD PTR [edx+ecx*4]
 movaps xmm1, xmm2 addss xmm6, DWORD PTR _o$[esp]
 mulss xmm0, xmm15 mulss xmm6, DWORD PTR _p$[esp]
 addss xmm1, DWORD PTR [rax] addss xmm5, xmm6
 subss xmm3, xmm0 movss xmm6, DWORD PTR [edx+ecx*4]
 movaps xmm0, xmm2 addss xmm6, DWORD PTR _q$[esp]
 mulss xmm1, DWORD PTR [rcx] mulss xmm6, DWORD PTR _r$[esp]
 addss xmm0, DWORD PTR [rdx] subss xmm5, xmm6
 addss xmm3, xmm1 movss xmm6, DWORD PTR [edx+ecx*4]
 mulss xmm0, DWORD PTR [r8] addss xmm6, DWORD PTR _s$[esp]
 movaps xmm1, xmm2 mulss xmm6, xmm4
 addss xmm1, DWORD PTR [r9] addss xmm5, xmm6
 subss xmm3, xmm0 movss xmm6, DWORD PTR [edx+ecx*4]
 mulss xmm1, DWORD PTR [r10] addss xmm6, xmm3
 movaps xmm0, xmm2 mulss xmm6, xmm2
 addss xmm0, DWORD PTR [r11] subss xmm5, xmm6
 addss xmm2, DWORD PTR [rdi] movss xmm6, DWORD PTR [edx+ecx*4]
 mulss xmm0, DWORD PTR [rbx] addss xmm6, xmm1
 mulss xmm2, DWORD PTR [rsi] mulss xmm6, xmm0
 addss xmm3, xmm1 addss xmm5, xmm6
 subss xmm3, xmm0 movss DWORD PTR [edx+ecx*4], xmm5
 addss xmm3, xmm2 inc ecx
 movss DWORD PTR [rbp-4], xmm3 cmp ecx, edi
 jne $LL6@triadplus2 jl $L56934

To Start

Numeric Accuracy

The run time display and log files show the numeric result of calculations and values from using the same default parameters are shown below. There is some variation in rounding after calculations, different with SSE, i387 and CUDA instructions.

 4 Byte Ops Repeat SSE i387 i387 SSE 64b SSE 64b SSE 64b CUDA
 Words /Wd Passes 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 4 CPU 8600GT
 100000 2 2500 0.929538 0.929475 0.929475 0.929538 0.929538 0.929538 0.929538
 1000000 2 250 0.992550 0.992543 0.992543 0.992550 0.992550 0.992550 0.992550
 10000000 2 25 0.999250 0.999249 0.999249 0.999250 0.999250 0.999250 0.999250
 100000 8 2500 0.957117 0.957164 0.957164 0.957117 0.957117 0.957117 0.956980
 1000000 8 250 0.995517 0.995525 0.995525 0.995517 0.995517 0.995517 0.995509
 10000000 8 25 0.999549 0.999550 0.999550 0.999549 0.999549 0.999549 0.999549
 100000 32 2500 0.890211 0.890377 0.890377 0.890211 0.890211 0.890211 0.890079
 1000000 32 250 0.988082 0.988102 0.988102 0.988082 0.988082 0.988082 0.988073
 10000000 32 25 0.998796 0.998799 0.998799 0.998796 0.998796 0.998796 0.998799

To Start

Roy Longbottom at Linkedin Roy Longbottom July 2014

The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

Roy Longbottom at Linkedin OpenMP Parallel Computing Benchmarks

Contents

Summary

General

Example Log Files

Results From Different Versions

64-Bit Comparisons

Assembly Code

Numeric Accuracy